Analysis

Goals

The goal of this analysis is to find data outliers, interesting localized pieces of data or anything that jumps out as interesting when it comes to water boil notices in Texas.

Questions for analysis:

  • What counties and/or regions have the most amount of water boil notices?
  • What counties and/or regions have the longest lasting water boil notices?
  • What do ongoing water boil notices look like?
  • What are the most common reasons for water boil notices? Are these statewide or localized problems?
  • How many people in Texas are affected by water boil notices each year?

##Loading up tidyverse

Here I am loading all the different libraries I need in order to make a proper analysis.

library(tidyverse)
library(janitor)
library(readxl)
library(tigris)
library(sf)
library(mapview)

Importing Clean Data

Here I’m importing all of my clean data from my cleaning notebook.

water_boil <- read_rds("data-processed/01-water_boil.rds")

water_boil 

##Creating a graph

Because most of my data is by county, I want to look at where water boil notices are located and what they look like. In order to visualize this, I will use shake files for all Texas counties.

##counties <- counties(cb = TRUE, class = "sf")

counties <- st_read("data-raw/tl_2024_us_county/tl_2024_us_county.shp") |>
  filter(STATEFP == "48") |>
  mutate(COUNTY = str_to_upper(NAME))
Reading layer `tl_2024_us_county' from data source 
  `/Users/mariaprobert/rwd/UTMediaFellowship- water_project/water_boil_notices-mariaprobert/data-raw/tl_2024_us_county/tl_2024_us_county.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 3235 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
Geodetic CRS:  NAD83
glimpse(counties)
Rows: 254
Columns: 20
$ STATEFP  <chr> "48", "48", "48", "48", "48", "48", "48", "48", "48", "48", "…
$ COUNTYFP <chr> "327", "189", "011", "057", "077", "361", "177", "147", "265"…
$ COUNTYNS <chr> "01383949", "01383880", "01383791", "01383814", "01383824", "…
$ GEOID    <chr> "48327", "48189", "48011", "48057", "48077", "48361", "48177"…
$ GEOIDFQ  <chr> "0500000US48327", "0500000US48189", "0500000US48011", "050000…
$ NAME     <chr> "Menard", "Hale", "Armstrong", "Calhoun", "Clay", "Orange", "…
$ NAMELSAD <chr> "Menard County", "Hale County", "Armstrong County", "Calhoun …
$ LSAD     <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
$ CLASSFP  <chr> "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "…
$ MTFCC    <chr> "G4020", "G4020", "G4020", "G4020", "G4020", "G4020", "G4020"…
$ CSAFP    <chr> NA, "352", "108", "544", NA, NA, NA, "206", "484", NA, NA, NA…
$ CBSAFP   <chr> NA, "38380", "11100", "38920", "48660", "13140", NA, "14300",…
$ METDIVFP <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ FUNCSTAT <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
$ ALAND    <dbl> 2336237980, 2602109424, 2354617584, 1312947260, 2819873723, 8…
$ AWATER   <dbl> 613559, 246678, 12183672, 1361644522, 72504932, 118336455, 82…
$ INTPTLAT <chr> "+30.8852677", "+34.0684364", "+34.9641790", "+28.4417191", "…
$ INTPTLON <chr> "-099.8588613", "-101.8228879", "-101.3566363", "-096.5795739…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-99.7712 30..., MULTIPOLYGON (((…
$ COUNTY   <chr> "MENARD", "HALE", "ARMSTRONG", "CALHOUN", "CLAY", "ORANGE", "…

What counties and/or regions have the most amount of water boil notices?

Let’s look at the amount of water boil notices per each county in Texas. I experimented with shakefiles, importing visuals from Tableau and used the mapview() package. In the end, the mapview() package turned out to be the most engaging visualization format for this project.

To look at water boil notices I grouped my counties and looked at any appearances.

Counties:

water_boil_county <- water_boil |>
  group_by(county) |>
  summarize(appearances = n()) |> 
  arrange(desc(appearances)) |> 
  mutate( state = "TEXAS") 


water_boil_county

Download into CSV file:

write_csv(water_boil_county, "/Users/MariaProbert/rwd/water_boil_county.csv")

Then I reformatted my data so that I can apply it to a visualization:

graph_counties <- water_boil |>
group_by(county) |>
  summarize(total_notices = n()) |>
  mutate(county = str_to_title(county)) 

graph_counties

Here I am using the shakefile to prepare the mapview() visualization.

glimpse(counties)
Rows: 254
Columns: 20
$ STATEFP  <chr> "48", "48", "48", "48", "48", "48", "48", "48", "48", "48", "…
$ COUNTYFP <chr> "327", "189", "011", "057", "077", "361", "177", "147", "265"…
$ COUNTYNS <chr> "01383949", "01383880", "01383791", "01383814", "01383824", "…
$ GEOID    <chr> "48327", "48189", "48011", "48057", "48077", "48361", "48177"…
$ GEOIDFQ  <chr> "0500000US48327", "0500000US48189", "0500000US48011", "050000…
$ NAME     <chr> "Menard", "Hale", "Armstrong", "Calhoun", "Clay", "Orange", "…
$ NAMELSAD <chr> "Menard County", "Hale County", "Armstrong County", "Calhoun …
$ LSAD     <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
$ CLASSFP  <chr> "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "…
$ MTFCC    <chr> "G4020", "G4020", "G4020", "G4020", "G4020", "G4020", "G4020"…
$ CSAFP    <chr> NA, "352", "108", "544", NA, NA, NA, "206", "484", NA, NA, NA…
$ CBSAFP   <chr> NA, "38380", "11100", "38920", "48660", "13140", NA, "14300",…
$ METDIVFP <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ FUNCSTAT <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
$ ALAND    <dbl> 2336237980, 2602109424, 2354617584, 1312947260, 2819873723, 8…
$ AWATER   <dbl> 613559, 246678, 12183672, 1361644522, 72504932, 118336455, 82…
$ INTPTLAT <chr> "+30.8852677", "+34.0684364", "+34.9641790", "+28.4417191", "…
$ INTPTLON <chr> "-099.8588613", "-101.8228879", "-101.3566363", "-096.5795739…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-99.7712 30..., MULTIPOLYGON (((…
$ COUNTY   <chr> "MENARD", "HALE", "ARMSTRONG", "CALHOUN", "CLAY", "ORANGE", "…
water_boil_graph <- counties |>
  left_join(graph_counties, by= c("NAME"= "county")) ##|>
 # filter(STATE_NAME == "Texas")
ggplot(water_boil_graph) + 
  geom_sf(
    aes(fill = total_notices), color = "white", size = 0.2
  ) + 
  scale_fill_gradient(low = "#335EF0",high = "#FF0000")+ 
  theme_void() + 
  labs(
    title = "Where are water boil notices located in Texas?", 
    subtitle = str_wrap("This chart looks at the concentration of water boil notices per county. Harris county has the most amount of water boil notices in the state."), 
    caption = "Source = Texas Commission on Environmental Quality", 
    fill = "Total Notices"
    ) 

  • save as image and call into summary ggsave –> object first then ggsave

Water Boil Notices Interactive Map

Using the shakefile and the water boil graph, I added these to the mapview() function and used a similar color scheme to show differences in water boil concentrations.

I used the mapview() package to make an interactive map. To look at the county name, just hover on the county and click for extra information.

mapview(water_boil_graph, zcol = "total_notices", col.regions= c("orange","#e67e22","#e74c3c", "#FF0000", "#7b241c", "#641e16"))

The county with the most amount of water boil notices is Harris county (Houston area), Montgomery (area above Houston), Bell (which makes up Temple, Killeen, etc.), and Brazoria county (South of Houston).

Data Takeaways:

  • East Texas or the surrounding Houston area is the area in Texas with the most amount of water boil notices. Harris county is one of the most populous counties in Texas.

TCEQ also has water boil notice data based on regions. These regions are specific to TCEQ and can be seen “here”.

Here I broke down the data by region.

Regions:

water_boil |>
  group_by(region) |>
  summarize(appearances = n()) |> 
  arrange(desc(appearances))

Data Takeaway:

  • Region 12 has the most amount of water boil notices at 1,716. Region 12 covers Houston. Again, Houston area is the region with the most amount of water boil notices.

What counties and/or regions have the longest lasting water boil notices?

I want to look at the counties with the longest lasting notice lengths, so let’s look at the average length to determine the county with the longest lasting notices. Then, I also want to record the minimum and maximum amount of days that water boil notices last in each county.

water_length <- water_boil |>
  group_by(county) |>
  summarize(
    average_notice_length = mean(notice_length, na.rm = TRUE),
    min_notice_length = min(notice_length, na.rm = TRUE), 
    max_notice_length = max(notice_length, na.rm = TRUE)
    ) |>
  arrange(desc(average_notice_length)) |>
  mutate( state = "TEXAS",   county = str_to_title(county))  |>
  mutate(average_days = as.numeric(average_notice_length)) |>
  mutate(max_days = as.numeric(max_notice_length)) |>
  mutate(min_days = as.numeric(min_notice_length))
Warning: There were 4 warnings in `summarize()`.
The first warning was:
ℹ In argument: `min_notice_length = min(notice_length, na.rm = TRUE)`.
ℹ In group 194: `county = "TERRY"`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
water_length 
 water_length_shape <- counties |>
  left_join(water_length, by= c("NAME"= "county")) ##|>
  ##filter(STATE_NAME == "Texas")

Download into CSV file:

write_csv(water_length, "/Users/MariaProbert/rwd/water_length.csv")

Let’s look at the longest average water boil notice durations and the maximum and minimum water boil notice lengths per county. I used the same format as above and made sure to include the data and join it to the shapefile. I chose a different color scheme to differentiate the data.

mapview(water_length_shape, zcol= "average_days", col.regions= c("blue","#5a9ef5", "#83cbf6", "#5ae2f5")) 
Warning: Found less unique colors (4) than unique zcol values (171)! 
Interpolating color vector to match number of zcol values.

#Average Water Boil Notice Lenght

Data Takeaways:

  • None of the top three counties with the longest average amount of days per water boil notice show up in the top of the list of most water boil notices. But this makes sense. For example, places like East Austin have a higher population, therefore a higher demand for water and a higher demand for water boil notices to get resolved.

Let’s look at Live Oak (the county with the longest average water boil notices) more closely:

water_boil |>
filter(
  county == "LIVE OAK"
) |>
  select(county, population, issued, rescinded, notice_length, reason_activity, reason, affected_area_activity, affected_area, pws_name)

Live Oak county is located South of San Antonio and has a very small population. The water boil notices affect all residents in the public water system. The most recent water boil notice happened in June and lasted almost a year.

Now, I want to look at the top counties with the most water boil notices and check how long their water boil notices last.

Let’s look at Harris as an example:

water_boil |>
  group_by(county) |>
  summarize(
    average_notice_length = mean(notice_length, na.rm = TRUE),
    min_notice_length = min(notice_length, na.rm = TRUE), 
    max_notice_length = max(notice_length, na.rm = TRUE)
    ) |>
  arrange(desc(average_notice_length)) |>
  filter(
    county == "HARRIS"
  )
Warning: There were 4 warnings in `summarize()`.
The first warning was:
ℹ In argument: `min_notice_length = min(notice_length, na.rm = TRUE)`.
ℹ In group 194: `county = "TERRY"`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

When I check for Harris we get NA, which means that there is still at least one pending water boil notice at the time that this data was collected. So the data can’t calculate the average, minimum and maximum length of water boil notices in the area.

To check the length of the water boil notices that have been rescinded, I added “na.rm = TRUE.” This removes all ongoing water boil notices and gives us a comparable average.

Data Takeaways:

  • Counties with the longest lasting water boil notices tend to have smaller populations. The top two counties with the longest lasting notices are located South of San Antonio.
  • Harris county has at least one pending water boil notice at the time this data was collected.

Regions:

Now let’s check regions with the longest lasting water boil notices:

water_length_region <- water_boil |>
  group_by(region) |>
  summarize(
    average_notice_length = mean(notice_length, na.rm = TRUE),
    min_notice_length = min(notice_length, na.rm = TRUE), 
    max_notice_length = max(notice_length, na.rm = TRUE)
    ) |>
  arrange(desc(average_notice_length)) 

water_length_region

If I don’t remove the NA examples, 12 out of the 16 regions designated by TCEQ in Texas have at least one ongoing water boil notice at the time this data was collected.

Data Takeaways:

  • Region 16 covers Laredo. The highest average water boil notice length is 29 days.
  • 12 out of the 16 regions designated by TCEQ in Texas have at least one ongoing water boil notice at the time this data was collected.

Maximum length of water boil notices

Now let’s look at the maximum length of water boil notices in the state of Texas. We will use the same data from above and the mapview() function.

mapview(water_length_shape, zcol= "average_days", col.regions= c("blue","#5a9ef5", "#83cbf6", "#5ae2f5")) 
Warning: Found less unique colors (4) than unique zcol values (171)! 
Interpolating color vector to match number of zcol values.

Minimum length of water boil notices

What do ongoing water boil notices look like?

The original results for the duration of the water boil notices gave us a lot of NA results, each NA represents water boil notices that are ongoing at the time the data was collected.

Let’s look at the amount of ongoing water boil notices in each county in the time span of Jan. 2022 to Jan. 2024.

water_boil |>
  filter(if_any(c(rescinded, notice_length), is.na)) |> 
  group_by(county) |>
 summarize(appearances = n()) |> 
 arrange(desc(appearances))

Data Takeaways: The counties with the most ongoing water notices are Harris, Johnson and Wise. Again, Harris county has a very high amount of water boil notices.

Now let’s look at the total ongoing water boil notices in Texas at the time this data was collected.

water_boil |>
  filter(if_any(c(rescinded, notice_length), is.na)) |> 
  summarize(appearances = n()) 

Data Takeaways: In the state of Texas, there were at least 69 total ongoing water boil notices (in all counties and regions) at the time that this data was collected.

Harris County water notices

Let’s look at the water boil notices in Harris:

water_boil |>
  filter(if_any(c(rescinded, notice_length), is.na)) |> 
  group_by(county) |>
  filter(county == "HARRIS")

Let’s add up the population affected by ongoing water notices in Harris county:

water_boil |>
  filter(if_any(c(rescinded, notice_length), is.na)) |> 
  group_by(county) |>
  filter(county == "HARRIS") |>
  summarise(harris_population_notice = sum(population))

Data Takeaways:

  • In Harris county at the time that this data was collected at least 13,594 people were affected by water boil notices.

Total Population affected by water boil notices

Let’s look at the total population affected by water boil notices in each county:

population_ongoing_notice <- water_boil |>
  filter(if_any(c(rescinded, notice_length), is.na)) |> 
  group_by(county) |>
  summarise(county_population_notice = sum(population)) |>
  arrange(desc(county_population_notice))

population_ongoing_notice

** MAP?

Data Takeaways:

  • The counties with the most population affected by water boil notices matches up to counties with a high population density. For example, Harris and the Houston area as well as Smith county and Tyler.

Let’s try this by region and look at populations under water boil notices by region:

water_boil |>
  filter(if_any(c(rescinded, notice_length), is.na)) |> 
  group_by(region) |>
 summarize(appearances = n()) |> 
 arrange(desc(appearances))

Data Takeaways: Region 12 which covers Harris county surpasses all other counties in terms of population affected.

Now let’s look at the total population in Texas under ongoing water boil notices:

water_boil |>
  filter(if_any(c(rescinded, notice_length), is.na)) |> 
  summarise(texas_population_notice = sum(population)) 

What are the most common reasons for water boil notices? Are these statewide or localized problems?

Let’s look at the main reasons why water notices happen in the state of Texas. For this I grouped for “reason_activity” and other columns that help explain the nature of the water boil notices.

water_boil |>
  group_by(reason_activity) |>
  summarize(appearances = n()) |> 
  arrange(desc(appearances))

Now let’s check to see if more of these issues are systemwide or more localized problems.

water_boil |>
  group_by(affected_area_activity) |>
  summarize(appearances = n()) |> 
  arrange(desc(appearances))

Note: BWN stands for boil water notice.

Data Takeaways:

  • Water boil notices are more related to systemwide issues, rather than localized. But, not by that much of a difference.

How many people in Texas are affected by water boil notices each year?

Excluding the ongoing water boil notices that we already looked at, lets look at the total population under resolved water boil notices between Jan. 2022 and Jan. 2024.

water_boil |>
  summarize(population_notice = sum(population))

Main Takeaways